How to build phylogenetic species trees with OMA

Knowledge of species phylogeny is critical to many fields of biology. In an era of genome data availability, the most common way to make a phylogenetic species tree is by using multiple protein-coding genes, conserved in multiple species. This methodology is composed of several steps: orthology inference, multiple sequence alignment and inference of the phylogeny with dedicated tools. This can be a difficult task, and orthology inference, in particular, is usually computationally intensive and error prone if done ad hoc. This tutorial provides protocols to make use of OMA Orthologous Groups, a set of genes all orthologous to each other, to infer a phylogenetic species tree. It is designed to be user-friendly and computationally inexpensive, by providing two options: (1) Using only precomputed groups with species available on the OMA Browser, or (2) Computing orthologs using OMA Standalone for additional species, with the option of using precomputed orthology relations for those present in OMA. A protocol for downstream analyses is provided as well, including creating a supermatrix, tree inference, and visualization. All protocols use publicly available software, and we provide scripts and code snippets to facilitate data handling. The protocols are accompanied with practical examples.


Introduction
Inferring accurate and complete species phylogenies is a fundamental problem in biology 1 . Traditionally, species trees have been inferred using ubiquitous marker genes such as the small subunit ribosomal RNA (SSU rRNA 16S/18S) gene 2 . However, as there are fewer sites to sample, using only one gene per species limits the resolution of the inference; the phylogeny of the gene may not necessarily reflect the evolutionary history of the entire species (due to, for example, incomplete lineage sorting, "hidden" duplications followed by loss of one copy, horizontal gene transfer, etc. 3 ). Additionally, phylogenies based on one gene are not always sufficient to obtain statistical support for difficult nodes due to their limited number of characters. For this reason, species phylogenies are now overwhelmingly inferred from multiple genes 4 . As long as one takes the necessary precautions, notably selecting true orthologs for their comparisons (see 4,5 for common pitfalls in phylogenomics), multilocus phylogenies are better resolved and more robust 6 . Recently, multiple protein-coding genes were used to infer a tree comprising ~3000 species, but this was still limited to a small number of concatenated ribosomal genes 7 . With the rise of next-generation sequencing, many hundreds of genes can now be considered when building species trees, which tremendously increases the available information for the inference.
To make use of all available genes, one needs to identify groups of genes that emerged from a common ancestral gene solely through speciation. These sets of genes, in which all pairs of genes are orthologs 8 , are commonly referred to as Orthologous Groups (OGs). Another term used for these types of groups are marker genes, or phylogenetic marker genes. The inference of OGs is non-trivial due to additional evolutionary events such as gene duplications, gene losses or horizontal gene transfers 9 . Furthermore, there are numerous algorithms for inferring orthology which can result in different OG composition, further complicating matters.
In this tutorial, we focus on OGs obtained from Orthologous MAtrix (OMA). Alternatively called "OMA Groups," they are stringently computed orthologs, they make use of all the available species in OMA, and are specifically designed for species tree inference. OMA Groups are defined as gene families that contain genes which are all orthologous to each other, with a maximum of one gene per species 10 . If recently diverged in-paralogs are inferred (i.e., co-orthologs), only one of the copies will be selected for the OG. Thus, all members of the OG are still orthologous to each other. This type of Orthologous Group is provided by only a few orthology databases such as BUSCO 11 and OMA 12,13 and, to our knowledge, only OMA allows for use of both precomputed and user-computed OGs. Moreover, OMA Groups have repeatedly been shown to produce reliable trees 12 and its underlying algorithm was shown to accurately infer OGs in a large-scale benchmark 14 . It is sometimes possible to rely on existing marker genes used in large-scale studies (for example 15 , but they are generally available only for a subset of species and do not include newly sequenced species. The OMA algorithm is freely available as an open source software tool (OMA Standalone 12 ) that integrates well with the public OMA Browser (https://omabrowser.org), which is a database that provides orthology information among more than 2400 genomes across the tree of life, selected to maximize taxon coverage and users' needs 16 . To date (December 2021 release) there are 1719 Bacteria, 155 Archaea, and 622 Eukaryotes. In this protocol, we show how to leverage the publicly available orthology data to infer phylogenetic species trees.
First, we set up the prerequisites and explain how to search for precomputed OGs for species of interest in the OMA database. Then, we show how to infer a species tree under two scenarios: (1) using only species that are present in the public OMA database, or (2) using species in OMA in addition to other proteomes not available in the database, e.g. a proteome obtained from sequencing a new species. By proteome we mean the protein sequences of all protein-coding genes annotated in a genome. Finally, we show how to do downstream processing and tree inference. Each of these steps is illustrated by an example on real data.

Materials
The tools needed for this tutorial can be found in Table 1. All commands can be run from the command line and/or with python scripts. We reference four tree inference software tools, but there are many other alternatives (see List of phylogenetics software).
Two examples will be used in this tutorial, one to illustrate Protocol 1, and another to illustrate Protocol 2. Both examples can be downloaded from FigShare 17 . Instructions on how to obtain the data from the OMA Browser for Protocol 1 are described in the next section, so it is not required to download anything to complete this protocol. However, Protocol 2 demonstrates how to add external proteomes. For this example we chose two proteomes available in the OMA Browser, but we set them aside after downloading the data. We then re-add them as external data (FOMPI.fa and YEAST.fa). For reproducibility, these proteomes can be found in the /data/ AddedGenomes subdirectory of Protocol 2. The rest of the data included in the example tarball are OGs and alignment files needed to compute the trees, which are also included as results.
The tree computations on these data have been performed using both RAxML 8.2.12 and IQ-TREE 1.7.beta17, as specified in the PDF accompanying the examples.

Protocols
Phylogenetic tree inference using OMA is done in three steps: getting OGs data, aligning all sequences of every OG and combining them into a supermatrix, and finally, using tree inference tools on the supermatrix. Depending on the species requirement, two options are available to obtain OG data, they are detailed in Protocol 1 and 2 subsections. Protocol 1 is the fastest and can be used if all species of interest are available on the OMA Browser. Alternatively, Protocol 2 is for the cases when new proteomes must be added, or when solely using data computed by OMA Standalone. Later steps are the same for both cases and are addressed in Protocol 3. Protocol 1: Export marker genes to make a phylogeny of species found in OMA This method is the quickest way to obtain data to build a phylogenetic species tree, but is only useful if one is interested in making a tree from species already in the OMA database. To do this, the Export marker genes function in the browser takes advantage of the precomputed OMA Groups. As mentioned in the Introduction, OMA Groups are a specific type of OG which contain sets of genes that are all orthologous to one another. This implies that there is at most one gene from each species in a group.

Finding species of interest in OMA.
The OMA public database and all related information are accessed through the OMA browser (https://omabrowser.org/). One can search for species of interest in the OMA database by browsing through the available data in OMA using the release info page (from the menu in the upper right corner: Explore -> Release information). Two browsing options are available, the default one is through an interactive tree, with colors indicating domains of life: bacteria are blue, archaea are green and eukaryotes are red. The other option is a table viewer featuring a search bar and both can be accessed through the Select species browser widget icons in the top right of the Species Information visualization.
Export the relevant data from OMA. The way to obtain OGs with only species present in the OMA database is by using the Download -> Export marker genes option ( Figure 1A) in the top right menu. This will open a page which allows the user to select species. Species can be searched by name or clade. A whole clade can be selected by clicking on the node (select all species). A single species can be selected by clicking on the leaf (select species). All selected species will be displayed in the right box with additional species information (release info, taxon id, etc.) ( Figure 1B).

Specifying the Minimum Species Coverage and Maximum Number of Markers parameters.
After species selection, exported OGs will depend on the minimum fraction of covered species and the maximum number of markers parameters: • Minimum species coverage: the lowest acceptable proportion of selected species that are present in any given OG in order to be exported. Figure 1. Exporting data from OMA for building a species tree. A) Choose which type of data to export from the Download tab on the right hand side of the home page. B) Select your proteomes from those in the OMA database by using the interactive species tree, which is based on the NCBI taxonomy.
A more permissive (lower) minimum species coverage will result in a higher number of exported groups. Choosing this parameter depends on the number of and how closely related are the selected species. For instance, consider the Drosophila clade versus chordates clade (20 and 116 species in the January 2020 release, respectively). If one selects the 20 Drosophila genomes and sets the minimum species coverage to 0.5, only OGs with at least 10 Drosophila species will be exported.
In the January 2020 release, this results in 11,855 OGs which meet this criteria. If using the same 0.5 minimum species coverage for the chordates, it results in 14,357 OGs exported. On the other hand, for a 0.8 minimum species coverage, 7,886 and 6,329 OGs are exported for Drosophila and chordates clades, respectively.
• Maximum number of markers: the maximum number of OGs/marker genes to return. To consider as much information as possible in the tree inference, remove any limit by setting this parameter to -1, in which case all OGs fulfilling the minimum species coverage parameter will be returned. To speed up the tree inference, set this value to below 1000 genes. When the number of markers is limited in this way, OGs with the highest coverage will be prioritized.
After filling in the parameters and submitting the request, the browser will return a compressed archive ("tarball") that contains a fasta file with unaligned sequences for each OG. Depending on the size of the request, it may take a few minutes for this operation to complete.
As an example for Protocol 1, we performed an analysis on 20 yeast species, using only OGs shared by all species (Minimum species coverage : 1) and no limit to the number of OGs retrieved (Maximum number of markers : -1). We obtained 169 OGs with this query. The corresponding data, including a list of the 20 species used, can be found at FigShare 17 , in the Protocol_1 folder.
Upon exporting the marker genes, i.e. OGs, from OMA, the data can be used to make a phylogenetic species tree (skip to Protocol 3: Downstream processing and tree inference).
Protocol 2: Export precomputed OMA all-against-all data as a backbone to add your own genomes and use with OMA standalone Orthology computation first starts with an all-against-all alignment phase一comparing all proteins in every species of interest to each other. If genomes to be included in the species phylogeny are not present in OMA (hereafter referred to as "added genomes"), it is necessary to first compute orthology predictions for the combined set of species (those in OMA plus the added genomes). This approach is computationally more expensive and requires that computations are performed on a local machine or high performance computing cluster (HPC). However, by using the OMA Browser's Download -> Export All/All option, one can take advantage of the precomputed all-against-all data for those species in OMA, saving time. The following protocol describes how to make use of this data and run OMA Standalone, the software for running the OMA algorithm on added genomes. In the case where the user wants to only use genomes unavailable in OMA, skip to the "Running OMA Standalone" section.
Export the all-against-all from OMA. Choose species which you want to combine with your own genomes by choosing Download -> Export All/All from the top right menu ( Figure 1A). This will lead to an interactive species tree of all the species in OMA, for which you can choose your species of interest to export ( Figure 1B).
After selecting species and clicking submit, the OMA Browser will export a tarball (described in Figure 2) which contains: • The all-against-all alignments of the selected species, found in the folder "Cache." • All exported genomes, in the format of protein fasta files, found in the folder "DB." • The full OMA standalone software tool. No need to download it separately.
Combining the added genomes with exported OMA data. Next, the added genomes data must be combined with the OMA data. For this procedure, the added genomes data must fulfill certain conditions: • Each additional dataset is in the form of a fasta file, containing protein sequences of all coding genes in the corresponding genome. Please note that OMA Standalone can work on nucleic coding sequences when starting from scratch, however for compatibility with pre-computed OMA data, only protein sequences may be used when combining new and exported data.

•
The name of the fasta file should identify the species clearly and uniquely. The exported genomes from OMA use for example UniProt's mnemonic five-letter species codes. The filename must end with a ".fa" suffix and must not contain any whitespace characters. The filename without the ".fa" suffix is used as the species name throughout the process and result files.
• Each sequence in the fasta file has a clear and unique identifier. We suggest not to use special characters such as brackets, dots, or a pipe character. The reason is that many programs use them for special purposes, e.g. brackets are used in the newick format for tree representation, and the pipe character is often used to separate ids and annotations.
If these conditions are fulfilled, these fasta files must be put into the DB folder with the other exported OMA genomes (Figure 2), where they will be considered as a unique dataset for the following steps.
Setting the parameters for OMA standalone. Before starting the computation, it is wise to adjust the parameters file, called "parameters.drw" (Figure 2), which can be edited with any text editor. If the goal is to only generate a dataset for species phylogeny inference (and not to keep other unrelated orthology inferences, such as Hierarchical Orthologous Groups 10 , which better represent individual genes' evolutionary histories but take time to compute), one can avoid doing computations and generating output files that are not needed by the following: • Uncomment (remove the # from) all the lines starting with WriteOutput EXCEPT #WriteOut-put_OrthologousGroupsFasta := false. By keeping that one commented, OMA standalone will produce one fasta file for each inferred OG. The important files and folders are colored. In green, the executable files mentioned in the course of the tutorial. In blue are the files and folder that will need to be modified. Other files and folders (in black) will not be used in the course of the tutorial. Files and folders not shown are represented by three dots.
• Deactivate the Hierarchical Orthologous Group inference, which is not needed here, by setting DoHierarchicalGroups := false; • Likewise, deactivate the gene function prediction by setting DoGroupFunctionPrediction := false; • Tip: do not omit a semicolon at the end of each uncommented statement.
Running OMA standalone. To run OMA standalone, one needs to be aware that the OMA pipeline can be split into two parts: all-against-all alignments for homology inference and orthology calling. Because OMA can compute Smith-Waterman alignments in parallel for all species which were not exported from OMA (see Export the all-against-all from OMA), it is beneficial to perform the computations on a computer cluster. However, if the dataset is small (e.g. 2-3 additional genomes), the computations can be run locally on a standard computer.
To run OMA standalone on a small dataset locally: 1. Within the extracted tarball folder you can start the computation with the command line: $: bin/oma -n NR_PROCESSES NR_PROCESSES should not be higher than the number of CPUs you have available on your machine.
For a larger dataset, we recommend the use of an HPC cluster. We recommend to break up the computations into two parts: first the all-against-all part, then the orthology inference part: 1. Create a submission script for your cluster. Examples of submission scripts are provided at https://omabrowser.org/standalone/#schedulers and 18.
2. Make sure that the submission script enters the folder into which the tarball was extracted, by either running the script from inside that directory or using the cd command appropriately.
3. The line to start the OMA all-against-all computation in the submission scripts is: The -s option means stop after the all-against-all phase. Since this part can be parallelized, we recommend using job-arrays. For this you need to set the number of parallel jobs as an environment variable (export NR_PROCESSES=100) and use the job-array syntax in the submission script (e.g. in LSF: bsub -J oma[1-$NR_PROCESSES] bin/oma -s). OMA Standalone automatically partitions the work chunks in a static and deterministic way among the specified number of workers. Progress of the entire computation can be checked with the OMA Status command (see below). For environments with limited runtimes/walltimes see https://omabrowser.org/standalone/ #advanced%20optimisations. 4. Check whether the all-against-all computation is finished using: This command will output a file formatted as: Summary of OMA standalone All-vs-All computations: Where the letters A, B, C, D, E and F represent numbers. Once the computations are completed, D should be equal to 0.0%, and both E and F to 100.0% 5. In the case where the jobs are finished but the all-against-all computation is still not complete, use the oma-cleanup and oma-compact commands before re-submitting.
$:bin/oma-cleanup $: bin/oma-compact These commands remove partially finished output files in the Cache/AllAll folder and zip all partial computations that are finished to one file, respectively.
6. Once the all-against-all computation has finished, the final step is the orthology calling. This step is more memory intensive, requires a single process, and can be called with: Once the computation finishes, all results will be stored in the newly-created "Output" folder. In this folder there will be an "EstimatedSpeciesTree.nwk" file that contains a phylogenetic tree that can be visualized using a tree visualization tool such as Phylo.io 19 . This is a distance tree based on the weighted average of the pairwise distances between sequences within the most complete OMA groups. This species tree is a rough estimate that is computed on the fly, and is not the final tree. It can be used as control to identify problems in the dataset but will not be as reliable as the tree inferred using the generated OGs later in this protocol. Therefore, it is recommended to use the OGs to compute your own tree with external software. The OGs (OMA Groups) 20 can be found in the "OrthologousGroupsFasta" folder, with each OG containing at least two species.
Usually for the construction of phylogenetic trees, one would select only OGs that contain at least X% of species, as described above with the parameter Minimum Species Coverage. The python script filter_groups.py from the git repository associated to this publication (https://doi.org/10.5281/zenodo.6037516 21 ) can be used to filter the OMA groups that contain at least X MIN_NR_SPECIES (replace <MIN_NR_SPECIES> and <destination/directory> with your own values): $: python filter_groups.py --min-nr-species <MIN_NR_SPECIES> --input Output/ OrthologousGroupFasta/ --output <destination/directory> For example, we performed an analysis adding two yeast proteomes hypothetically not available in OMA and 18 available yeast proteomes. As a first step, we downloaded the precomputed data for the 18 proteomes from the OMA Browser and launched the computation after adding two separate proteomes. Once the computation finished, we selected 880 OGs that included at least 90% of the 20 species 一 18 一 as a dataset to construct a tree. The data used in this example is available at FigShare 17 in the Protocol_2 folder. This command sequentially generates an MSA file (.aln) for each OG. Depending on the number of OGs and species in your dataset, executing it may take a prohibitive amount of time. If it is the case, we recommend using job-arrays to execute the alignments in parallel. In order to infer the phylogeny of the species from these alignments, they have to be concatenated in a single alignment commonly referred to as supermatrix. We provide a python script to automate this, concat_alignment.py, available on https://doi.org/10.5281/zenodo.6037516 21 .
The --format-output option allows for choosing the output format of this concatenation, either fasta or phylip format (some phylogenetic software requires a specific format as input). Once the python script is downloaded or cloned, ensure that all alignments are in the same folder, and launch using the following command: $: python concat_alignments.py <path>/<to>/<alignments>/*aln --format-output [fasta/phylip] > output After computing the supermatrix, the phylogenetic tree can be inferred using any number of available software. We recommend choosing from the tools in Table 2, sorted by computing time and increasing precision.

Tree visualization.
Most of the current phylogenetic inference tools provide trees in Newick format as output. In order to visualize such a tree, one can use the web-based viewer phylo.io (http://phylo.io) or other tree visualization tools (e.g. FigTree, phylogeny.io, etc). Displaying bootstrap values for internal nodes is recommended to evaluate the confidence of the inferred tree topology.
In our examples, we inferred trees by aligning the sequences with MAFFT, concatenated the alignments using the aforementioned concat_alignments.py, and ran both IQ-TREE and RAxML (Figure 3 and Figure 4). The data used for and the results from the computations can be found on FigShare 17 (alignments in the "data/Alignments" folder, and trees in the "tree" folder). The exact code used for these examples is on 21 .

Discussion and conclusion
With the wealth of genomic data available in an era of high-throughput sequencing, there is much to gain by making phylogenies from concatenations of multiple genes rather than from one single gene. This can better represent the evolutionary history of a clade, because the evolutionary history of a single gene can be

Software for making phylogenetic trees
Example command

IQ-Tree iqtree -s alignment.phy -m LG -T 20 --mem 20G -seed 12345 -bb 1000
RaML raxml-ng --threads 20 --model LG+G8+F --seed 15826 --msa alignment.phy --all --bs-trees 100 PhyloBayes pb_mpi -dc -gtr -cat -dgam 4 -x 10 1000 -d alignment.phy alignment.chain1 misrepresentative of a species evolutionary history. The principle of a supermatrix approach is that by combining multiple genes in one single phylogeny, one can combine the phylogenetic signal of multiple genes. One has to be careful however, to not combine "phylogenetic noise". Orthologs selection is particularly important in this regard 4,23 , because errors in orthology inference could add genes that are not true orthologs, but rather paralogs descending from the ancestral genes by a duplication event. Thus, they would have a different evolutionary history than the sought species phylogeny.
OMA Groups (or Orthologous Groups) are a well-suited set of orthologs for this kind of analysis, as the criteria used to compute these orthologs are stringent. They require that all genes are reciprocally closest genes in their respective species to all the other genes of the group and do not allow more than one gene in a species, thus excluding paralogs. In the Quest for Orthologs Benchmark 14 , the community benchmark for orthology inference, OMA Groups are consistently the most specific inference, although lacking in recall. As potentially missing genes are less detrimental to phylogenetic determination than false predictions are 24 , this is an appropriate choice of orthology inference method for this tutorial. Several phylogenies have already been published using OMA standalone or data from the OMA Browser, including those for archaea, sharks, spiders, worms, and insects, among others 25-29 .
This tutorial demonstrated how to carry out these different steps to infer a phylogenetic tree: orthology determination, sequence alignments, supermatrix construction, and phylogeny inference. It is designed to allow users to leverage the state of the art orthology inference provided by OMA Groups while reducing the necessary computation from their side, namely by relying on precomputed all-against-all alignments provided by the OMA Browser. We include code snippets and scripts that automate the whole process, and ensure reproducibility of all phylogenetic analyses following this protocol. The tutorial is accompanied by practical examples with all data available on GitHub and Figshare. This tutorial is designed to help users generate a species tree phylogeny on reliable data, by relying on the least amount of computation. Nevertheless, we advise care in interpretation of the obtained species tree. In particular, even with accurate selection of orthologs, non-phylogenetic noise may persist in the data, making some branches hard to resolve. It is exemplified in this tutorial by a few differences between the species tree produced by the two different protocols, likely due to difference in the number of genes used. To avoid misinterpretation of the data, it is wise to compute and report measures of bipartition consistency, like bootstrap support values 30 , while generating a species tree. A low bootstrap value will flag bipartitions that are subject to phylogenetic noise and that cannot be asserted with confidence. In our examples, bipartitions that differ between protocols have relatively low bootstrap values in the species tree.
For more information about the theory behind phylogenomics and the different methods, we refer the reader to recent reviews 31-33 . In the context of this tutorial, we used well-established MSA and phylogenetic tree inference tools. For the more difficult cases however, it advised to carefully choose which tool to use, including some tools which are not mentioned here. For more information about existing tools the readers are invited to turn to the relevant literature 32,34 . The protocols described here can be adapted to suit any other software compatible with standard data formats.

Data availability
The imported OG data and the OMA standalone software can be obtained from the OMA Browser (https://omabrowser.org), following instructions in this tutorial.

InBioS -PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium
This tutorial illustrates how to use the online OMA Browser and OMA Standalone package to assemble datasets of single-copy orthologous genes for phylogenomic inference. It covers two use cases: 1) all species of interest are present in the OMA Browser and 2) some species are not available and must thus be added by the user using OMA Standalone.
Generally speaking, the structure and writing are clear, the figures and tables are appropriate, and the data and code underlying the examples are well described and made available in two public repositories (Figshare and Zenodo). Therefore, I do not have major comments. However, I think that it is possible to easily improve a number of minor points in the manuscript. I provide a list below, simply sorted by order of occurrence. There are no line numbers, so I try to provide some context when needed. "16 rRNA ribosomal genes" should rather be "small subunit ribosomal RNA (SSU rRNA 16S/18S) gene" (no plural) to encompass the eukaryotic gene as well. ○ "among more than 2300 genomes across the tree of life": You might provide a few numbers here: bacteria, archaea, non-animals, non-plants -or at least recall the rationale for genome inclusion in OMA Browser because it is quite incomplete in terms of species diversity.
○ "using species in OMA in addition to other genomes not available in the database": It is not clear if OMA Standalone requires conceptual translations of the sequences (i.e., proteins) or if it can translate DNA sequences on the fly; if it cannot, please specify that one needs protein sequences and not genomes to make use of OMA.
○ "phylogenetic tree inference softwares": I am not a native English speaker, but "software" is uncountable; maybe you mean "software packages"? The interface of OMA Browser has slightly changed since the manuscript has been written. Two details that would need updating: "in the upper left of the home screen" is incorrect [for "Explore"] and "Compute" has now become "Download".
○ "To speed up the tree inference, set this value to below 1000 genes": One wonders how these <1000 genes would be prioritized and selected; this should be mentioned here.
○ "The name of the fasta file is either the species name or the 5-letter UniProt identifier...": This is not specific enough; what about whitespace, dots in strain names, etc? Are they allowed? Similarly, you should better define "other special characters" just below.
○ "...and not keeping other unrelated orthology inferences": This is a bit mysterious. I am aware that OMA can do more than what you demonstrate here, but this sentence should probably be slightly expanded with some examples.
○ "...use the job-array syntax in the submission script": This is the part where reproducibility is not straightforward. In particular, is there a master OMA instance controlling the parallelization or is this left to the user? From the code snippet provided, it is not clear, albeit the output of the oma-status command suggests that there is some orchestrator freeing the user from this chore. Please explain a bit more.
○ "we selected 880 OGs with at least 18 species as a dataset": Considering Figure 4, I understand your point: 90% of 20 species is 18. However, since you start with 18 genomes and add two "new" ones, the re-use of the number "18" is confusing to the user. As the analyses are done, I would not suggest to change this number, but rather to explain directly the 90% rationale in the text.
○ "This command generates a MSA file (.aln) for each OG.": This "for" loop is a very fine candidate for parallelization. Please remind the user that they could use a job array for this. Otherwise, they may wait a lot of time staring at the shell.
○ "there is much to gain by making phylogenies from multiple gene families": Here, I am strongly opposed to the words "gene families". The whole point of OMA is to identify singlecopy (orthologous) genes, not gene families (i.e., including paralogues). Please use instead "from concatenations of multiple genes, rather than from one single gene" (or something similar).
○ "because the evolutionary history of a single gene can be misrepresentative of a species evolutionary history": This is said twice in the manuscript (and this is true) but not backed up by specific reasons. Please enumerate a few of them between parentheses.

○
In the legend of Figure 3 and 4, please provide the evolutionary model used. I know it is in Table 2, but somewhat hidden in the command line arguments of the software packages.
○ "in this case both trees have identical topology" [ Figure 4]: This is correct, but this topology is widely different from the two topologies of Figure 3. This shows that phylogenetic artefacts are at play (which is well known for yeasts...) and this might worth mentioning in the sentences about the caveats of phylogenomics. In other words, it is a pity that this use case does not provide a clear phylogenetic solution for the selected genomes.

Natasha Glover, Swiss Institute of Bioinformatics, Lausanne, Switzerland
We thank the reviewer for their very helpful, clear, and pertinent remarks. Each original comment is in italics and our responses are in bold.

"see 3 for common pitfalls in phylogenomics": Don't you mean Philippe et al. (2017) "Pitfalls in supermatrix phylogenomics1" instead?
The previous reference addressed most of the same issues as the one the reviewer suggests, but the latter is more recent and comprehensive. We now cite both.

"OGs can have at most one representative gene per species": You should probably say a word about recently diverged in-paralogs (which do not harm species phylogenies) because exactly single-copy genes barely exist.
We have now updated the text to describe how in-paralogs are handled: "If recently diverged in-paralogs are inferred (i.e., co-orthologs), only one of the copies will be selected for the OG. Thus, all members of the OG are still orthologous to each other." 4. "among more than 2300 genomes across the tree of life": You might provide a few numbers here: bacteria, archaea, non-animals, non-plants -or at least recall the rationale for genome inclusion in OMA Browser because it is quite incomplete in terms of species diversity.
We added more details on species selection and species breakdown on major clades. We also updated the numbers to be in line with the latest release of OMA: "...provides orthology information among more than 2400 genomes across the tree of life, selected to maximize taxon coverage and users' needs ( We added precision on the input format specification for Protocol 2. Since it needs to be compatible with precomputed OMA data, only protein sequence may be used for this Protocol. We updated the text at the end of the Introduction to say "using species in OMA in addition to other proteomes not available in the database, e.g. a proteome obtained from sequencing a new species. By proteome we mean all protein or nucleic acid sequences of protein-coding genes annotated in a genome." We also changed the later text in "Combining the added genomes with exported OMA data" to read: "For this procedure, the added genomes data must fulfil certain conditions: Each additional dataset is in the form of a fasta file, containing protein sequences of all coding genes in the corresponding genome. Please note that OMA Standalone can work on nucleic coding sequences when starting from scratch, however for compatibility with the pre-computed OMA data, only protein sequences may be used when combining new and exported data. " 6. "phylogenetic tree inference softwares": I am not a native English speaker, but "software" is uncountable; maybe you mean "software packages"?
We corrected this error, following the reviewer's suggestion. Thanks, corrected.
9. The interface of OMA Browser has slightly changed since the manuscript has been written. Two details that would need updating: "in the upper left of the home screen" is incorrect [for "Explore"] and "Compute" has now become "Download".
We changed the text to be adapted to the latest version of the browser. We updated Figure 1 as well to reflect the new menus and design.
10. "To speed up the tree inference, set this value to below 1000 genes": One wonders how these <1000 genes would be prioritized and selected; this should be mentioned here.
When the number of markers is limited, Orthologous Groups are selected in decreasing order of species representation. This ensures the genes with representatives in most of the selected species will be prioritized. We adapted the text to reflect this. We agree, this could be clarified. We now changed the text to say: "The added genomes data must fulfill certain conditions: -The name of the fasta file should identify the species clearly and uniquely. The exported genomes from OMA use for example UniProt's mnemonic five-letter species codes. The filename must end with a ".fa" suffix and must not contain any whitespace characters. The filename without the ".fa" suffix is used as the species name throughout the process and result files.
-Each sequence in the fasta file has a clear and unique identifier. We suggest not to use special characters such as brackets, dots, or a pipe character. The reason is that many programs use them for special purposes, e.g. brackets are used in the newick format for tree representation, and the pipe character is often used to separate ids and annotations." We try to make this more clear by now saying: "OMA Standalone automatically partitions the work chunks in a static and deterministic way among the specified number of workers. Progress of the entire computation can be checked with the OMA Status command (see below)." 14. "we selected 880 OGs with at least 18 species as a dataset": Considering Figure 4, I understand your point: 90% of 20 species is 18. However, since you start with 18 genomes and add two "new" ones, the re-use of the number "18" is confusing to the user. As the analyses are done, I would not suggest to change this number, but rather to explain directly the 90% rationale in the text.
We reformulated a bit to make the reasoning more explicit. We now say: "Once the computation finished, we selected 880 OGs that included at least 90% of the 20 species --18 --as a dataset to construct a tree." 15. "This command generates a MSA file (.aln) for each OG.": This "for" loop is a very fine candidate for parallelization. Please remind the user that they could use a job array for this. Otherwise, they may wait a lot of time staring at the shell.
We now add a reminder that using job arrays is a recommended solution with big datasets. We changed the text to say: "This command sequentially generates a MSA file (.aln) for each OG. Depending on the number of OG and species in your dataset, executing it may take a prohibitive amount of time. If it is the case, we recommend using job-arrays to execute the alignments in parallel." 16. "there is much to gain by making phylogenies from multiple gene families": Here, I am strongly opposed to the words "gene families". The whole point of OMA is to identify single-copy (orthologous) genes, not gene families (i.e., including paralogues). Please use instead "from concatenations of multiple genes, rather than from one single gene" (or something similar).
We agree this phrase is misleading because even if OMA also provides orthology at the level of whole gene families (HOGs) we are using OGs here -single-copy orthologs. We changed the text as suggested by the reviewer.
17. "because the evolutionary history of a single gene can be misrepresentative of a species evolutionary history": This is said twice in the manuscript (and this is true) but not backed up by specific reasons. Please enumerate a few of them between parentheses.
We now enumerate some reasons and link to a reference on the first time this is mentioned, in the introduction.
18. In the legend of Figure 3 and 4, please provide the evolutionary model used. I know it is in Table 2, but somewhat hidden in the command line arguments of the software packages.
We now provide the evolutionary models used in both figures. The legends now say "Comparison of phylogenetic trees computed by IQ-TREE, using an LG substitution model (left), and RAxML, using an LG substitution model, a discrete Gamma model of rate heterogeneity with 8 categories, and empirical amino-acid frequencies (right)..." 19. "in this case both trees have identical topology" [ Figure 4]: This is correct, but this topology is widely different from the two topologies of Figure 3. This shows that phylogenetic artefacts are at play (which is well known for yeasts...) and this might worth mentioning in the sentences about the caveats of phylogenomics. In other words, it is a pity that this use case does not provide a clear phylogenetic solution for the selected genomes.
We thank the reviewer for pointing this out. It is true that there are some differences between the species tree obtained from the two protocols and that it must be addressed. The difference corresponds to bipartitions that are not well supported in either species tree but this was not easily apparent in the first version of the manuscript.
We made two major change to address this comment: We changed Figures 3 and 4 to show the least statistically bipartitions in the species tree.

○
In the introduction, we chose to mention only OMA and BUSCO, because they are to our knowledge the only maintained software that provides single-copy orthologs (does not include any paralogs). This is a prerequisite to use the supermatrix approach for species tree inference.
2. Please discuss the advantage of OMA compared with other tools.
We added a line indicating the main advantage of OMA compared to other methods: it is possible to use a precomputed dataset, and also to mix it with local genomic data, which makes it usable in numerous contexts. We now state:

Could you integrate the steps of analyses to one steps with multiple parameters?
As the scope of the paper is to provide a tutorial, we do not plan on integrating the steps into only one command line script.
One of the main reasons to not integrate it into a single command line is to leave more choice to the user regarding the non-OMA tools they may use, as well as their parameters. Given the high number of possible parameters at any step, it would make designing and using such a script highly complicated.
A second reason is that such an implementation will need to make technical choices (e.g regarding parallelization) which would make it incompatible with some computing infrastructures.